By: Tim Kartawijaya (tak2151)

library(tidyverse)
options(scipen = 999) #turn off scientific notation like 9+05
theme_set(theme_bw()) #set theme for all plot

1. Flowers

Data: flower dataset in cluster package

First, we load in the data.

library(cluster)
flower_df = flower
  1. Rename the column names and recode the levels of categorical variables to descriptive names. For example, “V1” should be renamed “winters” and the levels to “no” or “yes”. Display the full dataset.
#rename column names
names(flower_df) = c("winters","shadow","tubers","color","soil","preference","height","distance")

#use revalue levels
levels(flower_df$winters) = c("No", "Yes")
levels(flower_df$shadow) = c("No", "Yes")
levels(flower_df$tubers) = c("No", "Yes")
levels(flower_df$color) = c("white","yellow","pink","red","blue")
levels(flower_df$soil) = c("dry","normal","wet")

#display full dataset
flower_df
  1. Create frequency bar charts for the color and soil variables, using best practices for the order of the bars.

Below we provide two bar graphs, first as if the variables are non-ordinal (sorted) to see the most frequent value, then second when the variables are ordinal (non-sorted).

#flower_df$soil <- ordered(flower_df$soil, levels = c("2","1","3", "Blond", "Brown"))
ggplot(data = flower_df, mapping = aes(x = soil)) + 
  geom_bar(color = "black", fill = "light blue") + 
  scale_x_discrete(limits = c("normal","wet","dry")) + 
  labs(x = "Soil", y = "Frequency", title = "Top 3 Soil Types of Popular Flowers")

#flower_df$soil <- ordered(flower_df$soil, levels = c("2","1","3", "Blond", "Brown"))
ggplot(data = flower_df, mapping = aes(x = soil)) + 
  geom_bar(color = "black", fill = "light blue") +  
  labs(x = "Soil", y = "Frequency", title = "Top 3 Soil Types of Popular Flowers")

ggplot(data = flower_df, mapping = aes(x = color)) + 
  geom_bar(color = "black", fill = "light blue") + 
  scale_x_discrete(limits = c("red","yellow","pink","blue","white")) + 
  labs(x = "Color", y = "Frequency", title = "Top 5 Colors of Popular Flowers")

2. Minneapolis

Data: MplsDemo dataset in carData package

First, we load in the data.

library(carData)
mp_df = MplsDemo
  1. Create a Cleveland dot plot showing estimated median household income by neighborhood.
ggplot(data = mp_df, mapping = aes(x = hhIncome, y = reorder(neighborhood, hhIncome))) + 
  geom_point(color = "blue") + 
  labs(x = "Median Household Income ($)", y = "Neighborhood", title = "Median Household Income by Neighborhood")

  1. Create a Cleveland dot plot with multiple dots to show percentage of 1) foreign born, 2) earning less than twice the poverty level, and 3) with a college degree by neighborhood. Each of these three continuous variables should appear in a different color. Data should be sorted by college degree.
ggplot(data = mp_df, mapping = aes(y = reorder(neighborhood, collegeGrad))) + 
  geom_point(mapping = aes(x = collegeGrad), color = "purple") + 
  geom_point(mapping = aes(x = poverty), color = "black") + 
  geom_point(mapping = aes(x = foreignBorn), color = "dark orange") + 
  labs(x = "Proportion (%)", y = "Neighborhood", title = "Foreign Born, In Poverty, and College Grad Proportion by Neighborhood")

  1. What patterns do you observe? What neighborhoods do not appear to follow these patterns?

We can infer these following neighborhood patterns from the graphs above:

  1. The lower the proportion of people with a college degree (< 25%), the higher the proportion of foreign born (>20%). The exceptions to this pattern are neighborhoods like Seward, Lyndale, and Downtown East.

  2. When there aren’t many people with college degrees (< ~25%), the poverty proportion generally is above normal (10%), excepting some cities like Holland, Phillip West, Etc.

  3. The higher the median household income, the higher the proportion of college grads. We can see this more clearly from the additional Cleveland Dotplot below, showing that there is a positive linear relationship between proportion of college grad and median household income. Exceptions to this pattern are neighborhoods like Prospect Park and East River Road.

ggplot(data = mp_df, mapping = aes(y = reorder(neighborhood, collegeGrad))) + 
  geom_point(mapping = aes(x = hhIncome)) + 
  labs(x = "Median Household Income ($)", y = "Neighborhood", title = "Median Household Income by Neighborhood ordered by College Grad Proportion")

3. Taxis

Data: NYC yellow cab rides in June 2018, available here:

http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

It’s a large file so work with a reasonably-sized random subset of the data.

First, let’s import the data from the raw csv file.

cab = read_csv("yellow_tripdata_2018-06.csv")

Next, let’s draw a random sample of 10,000 values.

set.seed(12)
cab_sample = cab[sample(nrow(cab),10000),]

Draw four scatterplots of tip_amount vs. fare_amount with the following variations:

  1. Points with alpha blending
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) + 
  geom_point(alpha = 0.3, color = "black") + 
  labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount")

  1. Points with alpha blending + density estimate contour lines
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) + 
  geom_point(alpha = 0.3, color = "black") + 
  labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount") + 
  geom_density_2d()

The density contour plot is a little difficult to see here since there are a significant number of values clustered in the lower left corner of the plot.

  1. Hexagonal heatmap of bin counts
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) + 
  labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount") + 
  geom_hex(binwidth = c(5,35))

  1. Square heatmap of bin counts
ggplot(data = cab_sample, mapping = aes(x = tip_amount, y = fare_amount)) + 
  labs(x = "Tip Amount", y = "Fare Amount", title = "Tip Amount vs. Fare Amount") + 
  stat_bin2d(binwidth = c(2,16))

For all, adjust parameters to the levels that provide the best views of the data.

  1. Describe noteworthy features of the data, using the “Movie ratings” example on page 82 (last page of Section 5.3) as a guide.

There are many insights that can be derived from the plots above:

  1. Most Fare values are below $50.
  2. Most tip values are below $15.
  3. There is approximately a linear positive relationship between Tip and Fare i.e. not many Tip Values are above $10 when the fare is at $25.
  4. There are many Fare values at $50, which would probably be flat fares going to the airports.
  5. Most tip values are discrete, since people generally pay tip with cash and thus have to use common US dollar denominations: $1, $5, and $10.
  6. Most people don’t tip, shown by the many zero tip values.
  7. There are outliers in both directions: a very rude customer who had ~$230 of Fare but did not tip and a very generous customer who gave a $40 tip for a Flat fare to the airport.

4. Olive Oil

Data: olives dataset in extracat package

library(extracat)
olives_df = olives
  1. Draw a scatterplot matrix of the eight continuous variables. Which pairs of variables are strongly positively associated and which are strongly negatively associated?

From the plot below, the pairs of variables that have a strongly positive association are:

  1. Palmitic with Palmitoleic
  2. Palmitoleic with Linoelic

Those with a weaker positive association:

  1. Palmitic with Linoleic
  2. Linoleic with Eicosenoic
  3. Arachidic with Eicosenoic

Conversely, the pairs of variables that have a strongly negative association are:

  1. Palmitic with Oleic
  2. Palmitoleic with Oleic
  3. Oleic with Linoleic
pairs(olives_df[,3:10])

  1. Color the points by region. What do you observe?

From the colored scatterplot matrix below, we can observe several things:

  1. The eicosenoic levels of olives from the Olives region are generally greater than those from other regions.
  2. The spread and amount of data from the North region is more than those from other regions.
  3. The measurements from Sardinia and South generally form clusters and does not vary as much as those measurements from the North.
pairs(olives_df[,3:10], col = olives_df$Region)

#create legend
par(xpd = TRUE)
legend(-0.05,1.05,fill = unique(olives_df$Region), legend = c(levels(olives_df$Region)))

5. Wine

Data: wine dataset in pgmm package

(Recode the Type variable to descriptive names.)

  1. Use parallel coordinate plots to explore how the variables separate the wines by Type. Present the version that you find to be most informative. You do not need to include all of the variables.

First, we load the data.

library(pgmm)
data(wine)

Then, we create the parallel coordinate plot using variables for which the three types had significantly differentiated values. In this case, the variables used are:

  1. Alcohol
  2. Total Phenols
  3. Flavanoids
  4. Color Intensity
  5. Proline
library(GGally)
wine$Type = factor(wine$Type)
ggparcoord(wine, columns = c(2,16,17,20,27), groupColumn = 'Type', scale = 'std', alpha = 0.55)

  1. Explain what you discovered. From the graph above, there are several insights that we can take away:
  1. Type 1 has generally the highest values for all the variables, excepting Color Intensity, for which Type 3 has the highest values.
  2. Type 2 has the lowest Total Phenols and Flavanoids, but the highest Color Intensity.
  3. Type 3 has the lowest Alochol, Color Intensity, and Proline values.